Search Results: "sesse"

19 October 2017

Steinar H. Gunderson: Introducing Narabu, part 2: Meet the GPU

Narabu is a new intraframe video codec. You may or may not want to read part 1 first. The GPU, despite being extremely more flexible than it was fifteen years ago, is still a very different beast from your CPU, and not all problems map well to it performance-wise. Thus, before designing a codec, it's useful to know what our platform looks like. A GPU has lots of special functionality for graphics (well, duh), but we'll be concentrating on the compute shader subset in this context, ie., we won't be drawing any polygons. Roughly, a GPU (as I understand it!) is built up about as follows: A GPU contains 1 20 cores; NVIDIA calls them SMs (shader multiprocessors), Intel calls them subslices. (Trivia: A typical mid-range Intel GPU contains two cores, and thus is designated GT2.) One such core usually runs the same program, although on different data; there are exceptions, but typically, if your program can't fill an entire core with parallelism, you're wasting energy. Each core, in addition to tons (thousands!) of registers, also has some shared memory (also called local memory sometimes, although that term is overloaded), typically 32 64 kB, which you can think of in two ways: Either as a sort-of explicit L1 cache, or as a way to communicate internally on a core. Shared memory is a limited, precious resource in many algorithms. Each core/SM/subslice contains about 8 execution units (Intel calls them EUs, NVIDIA/AMD calls them something else) and some memory access logic. These multiplex a bunch of threads (say, 32) and run in a round-robin-ish fashion. This means that a GPU can handle memory stalls much better than a typical CPU, since it has so many streams to pick from; even though each thread runs in-order, it can just kick off an operation and then go to the next thread while the previous one is working. Each execution unit has a bunch of ALUs (typically 16) and executes code in a SIMD fashion. NVIDIA calls these ALUs CUDA cores , AMD calls them stream processors . Unlike on CPU, this SIMD has full scatter/gather support (although sequential access, especially in certain patterns, is much more efficient than random access), lane enable/disable so it can work with conditional code, etc.. The typically fastest operation is a 32-bit float muladd; usually that's single-cycle. GPUs love 32-bit FP code. (In fact, in some GPU languages, you won't even have 8-, 16-bit or 64-bit types. This is annoying, but not the end of the world.) The vectorization is not exposed to the user in typical code (GLSL has some vector types, but they're usually just broken up into scalars, so that's a red herring), although in some programming languages you can get to swizzle the SIMD stuff internally to gain advantage of that (there's also schemes for broadcasting bits by voting etc.). However, it is crucially important to performance; if you have divergence within a warp, this means the GPU needs to execute both sides of the if. So less divergent code is good. Such a SIMD group is called a warp by NVIDIA (I don't know if the others have names for it). NVIDIA has SIMD/warp width always 32; AMD used to be 64 but is now 16. Intel supports 4 32 (the compiler will autoselect based on a bunch of factors), although 16 is the most common. The upshot of all of this is that you need massive amounts of parallelism to be able to get useful performance out of a CPU. A rule of thumb is that if you could have launched about a thousand threads for your problem on CPU, it's a good fit for a GPU, although this is of course just a guideline. There's a ton of APIs available to write compute shaders. There's CUDA (NVIDIA-only, but the dominant player), D3D compute (Windows-only, but multi-vendor), OpenCL (multi-vendor, but highly variable implementation quality), OpenGL compute shaders (all platforms except macOS, which has too old drivers), Metal (Apple-only) and probably some that I forgot. I've chosen to go for OpenGL compute shaders since I already use OpenGL shaders a lot, and this saves on interop issues. CUDA probably is more mature, but my laptop is Intel. :-) No matter which one you choose, the programming model looks very roughly like this pseudocode:
for (size_t workgroup_idx = 0; workgroup_idx < NUM_WORKGROUPS; ++workgroup_idx)     // in parallel over cores
        char shared_mem[REQUESTED_SHARED_MEM];  // private for each workgroup
        for (size_t local_idx = 0; local_idx < WORKGROUP_SIZE; ++local_idx)    // in parallel on each core
                main(workgroup_idx, local_idx, shared_mem);
         
 
except in reality, the indices will be split in x/y/z for your convenience (you control all six dimensions, of course), and if you haven't asked for too much shared memory, the driver can silently make larger workgroups if it helps increase parallelity (this is totally transparent to you). main() doesn't return anything, but you can do reads and writes as you wish; GPUs have large amounts of memory these days, and staggering amounts of memory bandwidth. Now for the bad part: Generally, you will have no debuggers, no way of logging and no real profilers (if you're lucky, you can get to know how long each compute shader invocation takes, but not what takes time within the shader itself). Especially the latter is maddening; the only real recourse you have is some timers, and then placing timer probes or trying to comment out sections of your code to see if something goes faster. If you don't get the answers you're looking for, forget printf you need to set up a separate buffer, write some numbers into it and pull that buffer down to the GPU. Profilers are an essential part of optimization, and I had really hoped the world would be more mature here by now. Even CUDA doesn't give you all that much insight sometimes I wonder if all of this is because GPU drivers and architectures are meant to be shrouded in mystery for competitiveness reasons, but I'm honestly not sure. So that's it for a crash course in GPU architecture. Next time, we'll start looking at the Narabu codec itself.

18 October 2017

Steinar H. Gunderson: Introducing Narabu, part 1: Introduction

Narabu is a new intraframe video codec, from the Japanese verb narabu ( ), which means to line up or be parallel. Let me first state straight up that Narabu isn't where I hoped it would be at this stage; the encoder isn't fast enough, and I have to turn my attention to other projects for a while. Nevertheless, I think it is interesting as a research project in its own right, and I don't think it should stop me from trying to write up a small series. :-) In the spirit of Leslie Lamport, I'll be starting off with describing what problem I was trying to solve, which will hopefully make the design decisions a lot clearer. Subsequent posts will dive into background information and then finally Narabu itself. I want a codec to send signals between different instances of Nageru, my free software video mixer, and also longer-term between other software, such as recording or playout. The reason is pretty obvious for any sort of complex configuration; if you are doing e.g. both a stream mix and a bigscreen mix, they will naturally want to use many of the same sources, and sharing them over a single GigE connection might be easier than getting SDI repeaters/splitters, especially when you have a lot of them. (Also, in some cases, you might want to share synthetic signals, such as graphics, that never existed on SDI in the first place.) This naturally leads to the following demands: There's a bunch of intraframe formats around. The most obvious thing to do would be to use Intel Quick Sync to produce H.264 (intraframe H.264 blows basically everything else out of the sky in terms of PSNR, and QSV hardly uses any power at all), but sadly, that's limited to 4:2:0. I thought about encoding the three color planes as three different monochrome streams, but monochrome is not supported either. Then there's a host of software solutions. x264 can do 4:2:2, but even on ultrafast, it gobbles up an entire core or more at 720p60 at the target bitrates (mostly in entropy coding). FFmpeg has implementations of all kinds of other codecs, like DNxHD, CineForm, MJPEG and so on, but they all use much more CPU for encoding than the target. NDI would seem to fit the bill exactly, but fails the licensing check, and also isn't robust to corrupted or malicious data. (That, and their claims about video quality are dramatically overblown for any kinds of real video data I've tried.) So, sadly, this leaves only really one choice, namely rolling my own. I quickly figured I couldn't beat the world on CPU video codec speed, and didn't really want to spend my life optimizing AVX2 DCTs anyway, so again, the GPU will come to our rescue in the form of compute shaders. (There are some other GPU codecs out there, but all that I've found depend on CUDA, so they are NVIDIA-only, which I'm not prepared to commit to.) Of course, the GPU is quite busy in Nageru, but if one can make an efficient enough codec that one stream can work at only 5% or so of the GPU (meaning 1200 fps or so), it wouldn't really make a dent. (As a spoiler, the current Narabu encoder isn't there for 720p60 on my GTX 950, but the decoder is.) In the next post, we'll look a bit at the GPU programming model, and what it means for how our codec needs to look like on the design level.

11 September 2017

Steinar H. Gunderson: rANS encoding of signed coefficients

I'm currently trying to make sense of some still image coding (more details to come at a much later stage!), and for a variety of reasons, I've chosen to use rANS as the entropy coder. However, there's an interesting little detail that I haven't actually seen covered anywhere; maybe it's just because I've missed something, or maybe because it's too blindingly obvious, but I thought I would document what I ended up with anyway. (I had hoped for something even more elegant, but I guess the obvious would have to do.) For those that don't know rANS coding, let me try to handwave it as much as possible. Your state is typically a single word (in my case, a 32-bit word), which is refilled from the input stream as needed. The encoder and decoder works in reverse order; let's just talk about the decoder. Basically it works by looking at the lowest 12 (or whatever) bits of the decoder state, mapping each of those 2^12 slots to a decoded symbol. More common symbols are given more slots, proportionally to the frequency. Let me just write a tiny, tiny example with 2 bits and three symbols instead, giving four slots:
Lowest bits Symbol
00 0
01 0
10 1
11 2
Note that the zero coefficient here maps to one out of two slots (ie., a range); you don't choose which one yourself, the encoder stashes some information in there (which is used to recover the next control word once you know which symbol there is). Now for the actual problem: When storing DCT coefficients, we typically want to also store a sign (ie., not just 1 or 2, but also -1/+1 and -2/+2). The statistical distribution is symmetrical, so the sign bit is incompressible (except that of course there's no sign bit needed for 0). We could have done this by introducing new symbols -1 and -2 in addition to our three other ones, but this means we'll need more bits of precision, and accordingly larger look-up tables (which is negative for performance). So let's find something better. We could also simply store it separately somehow; if the coefficient is non-zero, store the bits in some separate repository. Perhaps more elegantly, you can encode a second symbol in the rANS stream with probability 1/2, but this is more expensive computationally. But both of these have the problem that they're divergent in terms of control flow; nonzero coefficients potentially need to do a lot of extra computation and even loads. This isn't nice for SIMD, and it's not nice for GPU. It's generally not really nice. The solution I ended up with was simulating a larger table with a smaller one. Simply rotate the table so that the zero symbol has the top slots instead of the bottom slots, and then replicate the rest of the table. For instance, take this new table:
Lowest bits Symbol
000 1
001 2
010 0
011 0
100 0
101 0
110 -1
111 -2
(The observant reader will note that this doesn't describe the exact same distribution as last time zero has twice the relative frequency as in the other table but ignore that for the time being.) In this case, the bottom half of the table doesn't actually need to be stored! We know that if the three bottom bits are >= 110 (6 in decimal), we have a negative value, can subtract 6, and then continue decoding. If we are go past the end of our 2-bit table despite that, we know we are decoding a zero coefficient (which doesn't have a sign), so we can just clamp the read; or for a GPU, reads out-of-bounds on a texture will typically return 0 anyway. So it all works nicely, and the divergent I/O is gone. If this pickled your interest, you probably want to read up on rANS in general; Fabian Giesen (aka ryg) has some notes that work as a good starting point, but beware; some of this is pretty confusing. :-)

7 September 2017

Steinar H. Gunderson: Licensing woes

On releasing modified versions of GPLv3 software in binary form only (quote anonymized):
And in my opinion it's perfectly ok to give out a binary release of a project, that is a work in progress, so that people can try it out and coment on it. It's easier for them to have it as binary and not need to compile it themselfs. If then after a (long) while the code is still only released in binary form, then it's ok to start a discussion. But only for a quick test, that is unneccessary. So people, calm down and enjoy life!
I wonder at what point we got here.

20 August 2017

Steinar H. Gunderson: Random codec notes

Post-Solskogen, there hasn't been all that many commits in the main Nageru repository, but that doesn't mean the project is standing still. In particular, I've been working with NVIDIA to shake out a crash bug in their drivers (which in itself uncovered some interesting debugging techniques, although in the end, the bug turned out just to be uncovered by the boring standard technique of analyzing crash dumps and writing a minimal program to reproduce). But I've also been looking at intraframe codecs; my sort-of plan was to go to VideoLAN Dev Days to present my findings, but unfortunately, there seems to be a schedule conflict, so instead, you can have some scattered random notes: Work in progress :-) Maybe something more coherent will come out eventually. Edit: Forgot about TF switching!

5 August 2017

Steinar H. Gunderson: Dear conference organizers

Dear conference organizers, In this day and age, people stream conferences and other events over the Internet. Most of the Internet happens to be in a different timezone from yours (it's crazy, I know!). This means that if you publish a schedule, please say which timezone it's in. We've even got this thing called JavaScript now, which allows you to also convert times to the user's local timezone (the future is now!), so you might want to consider using it. :-) (Yes, this goes for you, DebConf, and also for you, Assembly.)

17 July 2017

Steinar H. Gunderson: Solskogen 2017: Nageru all the things

Solskogen 2017 is over! What a blast that was; I especially enjoyed that so many old-timers came back to visit, it really made the party for me. This was the first year we were using Nageru for not only the stream but also for the bigscreen mix, and I was very relieved to see the lack of problems; I've had nightmares about crashes with 150+ people watching (plus 200-ish more on stream), but there were no crashes and hardly a dropped frame. The transition to a real mixing solution as well as from HDMI to SDI everywhere gave us a lot of new opportunities, which allowed a number of creative setups, some of them cobbled together on-the-spot: It's been a lot of fun, but also a lot of work. And work will continue for an even better show next year after some sleep. :-)

9 July 2017

Steinar H. Gunderson: Nageru 1.6.1 released

I've released version 1.6.1 of Nageru, my live video mixer. Now that Solskogen is coming up, there's been a lot of activity on the Nageru front, but hopefully everything is actually coming together now. Testing has been good, but we'll see whether it stands up to the battle-hardening of the real world or not. Hopefully I won't be needing any last-minute patches. :-) Besides the previously promised Prometheus metrics (1.6.1 ships with a rather extensive set, as well as an example Grafana dashboard) and frame queue management improvements, a surprising late addition was that of a new transcoder called Kaeru (following the naming style of Nageru itself, from the japanese verb kaeru ( ) which means roughly to replace or excahnge iKnow! claims it can also mean convert , but I haven't seen support for this anywhere else). Normally, when I do streams, I just let Nageru do its thing and send out a single 720p60 stream (occasionally 1080p), usually around 5 Mbit/sec; less than that doesn't really give good enough quality for the high-movement scenarios I'm after. But Solskogen is different in that there's a pretty diverse audience when it comes to networking conditions; even though I have a few mirrors spread around the world (and some JavaScript to automatically pick the fastest one; DNS round-robin is really quite useless here!), not all viewers can sustain such a bitrate. Thus, there's also a 480p variant with around 1 Mbit/sec or so, and it needs to come from somewhere. Traditionally, I've been using VLC for this, but streaming is really a niche thing for VLC. I've been told it will be an increased focus for 4.0 now that 3.0 is getting out the door, but over the last few years, there's been a constant trickle of little issues that have been breaking my transcoding pipeline. My solution for this was to simply never update VLC, but now that I'm up to stretch, this didn't really work anymore, and I'd been toying around with the idea of making a standalone transcoder for a while. (You'd ask why not the ffmpeg(1) command-line client? , but it's a bit too centered around files and not streams; I use it for converting to HLS for iOS devices, but it has a nasty habit of I/O blocking real work, and its HTTP server really isn't meant for production work. I could survive the latter if it supported Metacube and I could feed it into Cubemap, but it doesn't.) It turned out Nageru had already grown most of the pieces I needed; it had video decoding through FFmpeg, x264 encoding with speed control (so that it automatically picks the best preset the machine can sustain at any given time) and muxing, audio encoding, proper threading everywhere, and a usable HTTP server that could output Metacube. All that was required was to add audio decoding to the FFmpeg input, and then replace the GPU-based mixer and GUI with a very simple driver that just connects the decoders to the encoders. (This means it runs fine on a headless server with no GPU, but it also means you'll get FFmpeg's scaling, which isn't as pretty or fast as Nageru's. I think it's an okay tradeoff.) All in all, this was only about 250 lines of delta, which pales compared to the ~28000 lines of delta that are between 1.3.1 (used for last Solskogen) and 1.6.1. It only supports a rather limited set of Prometheus metrics, and it has some limitations, but it seems to be stable and deliver pretty good quality. I've denoted it experimental for now, but overall, I'm quite happy with how it turned out, and I'll be using it for Solskogen. Nageru 1.6.1 is on its way into Debian, but it depends on a new version of Movit which needs to go through the NEW queue (a soname bump), so it might be a few days. In the meantime, I'll be busy preparing for Solskogen. :-)

25 June 2017

Steinar H. Gunderson: Frame queue management in Nageru 1.6.1

Nageru 1.6.1 is on its way, and what was intended to only be a release centered around monitoring improvements (more specifically a full set of native Prometheus] metrics) actually ended up getting a fairly substantial change to how Nageru manages its frame queues. To understand what's changing and why, it's useful to first understand the history of Nageru's queue management. Nageru 1.0.0 started out with a fairly simple scheme, but with some basics that are still relevant today: One of the input cards was deemed the master card, and whenever it delivers a frame, the master clock ticks and an output frame is produced. (There are some subtleties about dropped frames and/or the master card changing frame rates, but I'm going to ignore them, since they're not important to the discussion.) To this end, every card keeps a preallocated frame queue; when a card delivers a frame, it's put into the queue, and when the master clock ticks, it tries picking out one frame from each of the other card's queues to mix together. Note that mix here could be as simple as picking one input and throwing all the other ones away; the queueing algorithm doesn't care, it just feeds all of them to the theme and lets that run whatever GPU code it needs to match the user's preferences. The only thing that really keeps the queues bounded is that the frames in them are preallocated (in GPU memory), so if one queue gets longer than 16 frames, Nageru starts dropping it. But is 16 the right number? There are two conflicting demands here, ignoring memory usage: The 1.0.0 scheme does about as well as one could possibly hope in never dropping frames, but unfortunately, it can be pretty poor at latency. For instance, if your master card runs at 50 Hz and you have a 60 Hz card, the latter will eventually build up a delay of 16 * 16.7 ms = 266.7 ms clearly unacceptable, and rather unneeded. You could ask the user to specify a queue length, but the user probably doesn't know, and also shouldn't really have to care more knobs to twiddle are a bad thing, and even more so knobs the user is expected to twiddle. Thus, Nageru 1.2.0 introduced queue autotuning; it keeps a running estimate on how big the queue needs to be to avoid underruns, simply based on experience. If we've been dropping frames on a queue and then there's an underrun, the safe queue length is increased by one, and if the queue has been having excess frames for more than a thousand successive master clock ticks, we reduce it by one again. Whenever the queue has more than this safe number, we drop frames. This was simple, effective and largely fixed the problem. However, when adding metrics, I noticed a peculiar effect: Not all of my devices have equally good clocks. In particular, when setting up for 1080p50, my output card's internal clock (which assumes the role of the master clock when using HDMI/SDI output) seems to tick at about 49.9998 Hz, and my simple home camcorder delivers frames at about 49.9995 Hz. Over the course of an hour, this means it produces one more frame than you should have which should of course be dropped. Having an SDI setup with synchronized clocks (blackburst/tri-level) would of course fix this problem, but most people are not so lucky with their cameras, not to mention the price of PC graphics cards with SDI outputs! However, this happens very slowly, which means that for a significant amount of time, the two clocks will very nearly be in sync, and thus racing. Who ticks first is determined largely by luck in the jitter (normal is maybe 1ms, but occasionally, you'll see delayed delivery of as much as 10 ms), and this means that the 1000 frames estimate is likely to be thrown off, and the result is hundreds of dropped frames and underruns in that period. Once the clocks have diverged enough again, you're off the hook, but again, this isn't a good place to be. Thus, Nageru 1.6.1 change the algorithm around yet again, by incorporating more data to build an explicit jitter model. 1.5.0 was already timestamping each frame to be able to measure end-to-end latency precisely (now also exposed in Prometheus metrics), but from 1.6.1, they are actually used in the queueing algorithm. I ran several eight- to twelve-hour tests and simply stored all the event arrivals to a file, and then simulated a few different algorithms (including the old algorithm) to see how they fared in measures such as latency and number of drops/underruns. I won't go into the full details of the new queueing algorithm (see the commit if you're interested), but the gist is: Based on the last 5000 frames, it tries to estimate the maximum possible jitter for each input (ie., how late the frame could possibly be). Based on this as well as clock offsets, it determines whether it's really sure that there will be an input frame available on the next master tick even if it drops the queue, and then trims the queue to fit. The result is pretty satisfying; here's the end-to-end latency of my camera being sent through to the SDI output: As you can see, the latency goes up, up, up until Nageru figures it's now safe to drop a frame, and then does it in one clean drop event; no more hundreds on drops involved. There are very late frame arrivals involved in this run two extra frame drops, to be precise but the algorithm simply determines immediately that they are outliers, and drops them without letting them linger in the queue. (Immediate dropping is usually preferred to sticking around for a bit and then dropping it later, as it means you only get one disturbance event in your stream as opposed to two. Of course, you can only do it if you're reasonably sure it won't lead to more underruns later.) Nageru 1.6.1 will ship before Solskogen, as I intend to run it there :-) And there will probably be lovely premade Grafana dashboards from the Prometheus data. Although it would have been a lot nicer if Grafana were more packaging-friendly, so I could pick it up from stock Debian and run it on armhf. Hrmf. :-)

18 June 2017

Eriberto Mota: Como migrar do Debian Jessie para o Stretch

Bem vindo ao Debian Stretch! Ontem, 17 de junho de 2017, o Debian 9 (Stretch) foi lan ado. Eu gostaria de falar sobre alguns procedimentos b sicos e regras para migrar do Debian 8 (Jessie). Passos iniciais
# apt-get update
# apt-get dist-upgrade
Migrando
deb http://ftp.br.debian.org/debian/ stretch main
deb-src http://ftp.br.debian.org/debian/ stretch main
   
deb http://security.debian.org/ stretch/updates main
deb-src http://security.debian.org/ stretch/updates main
# apt-get update
# apt-get dist-upgrade
Caso haja algum problema, leia as mensagens de erro e tente resolver o problema. Resolvendo ou n o tal problema, execute novamente o comando:
# apt-get dist-upgrade
Havendo novos problemas, tente resolver. Busque solu es no Google, se for necess rio. Mas, geralmente, tudo dar certo e voc n o dever ter problemas. Altera es em arquivos de configura o Quando voc estiver migrando, algumas mensagens sobre altera es em arquivos de configura o poder o ser mostradas. Isso poder deixar alguns usu rios pedidos, sem saber o que fazer. N o entre em p nico. Existem duas formas de apresentar essas mensagens: via texto puro em shell ou via janela azul de mensagens. O texto a seguir um exemplo de mensagem em shell:
Ficheiro de configura o '/etc/rsyslog.conf'
 ==> Modificado (por si ou por um script) desde a instala o.
 ==> O distribuidor do pacote lan ou uma vers o atualizada.
 O que deseja fazer? As suas op es s o:
 Y ou I : instalar a vers o do pacote do maintainer
 N ou O : manter a vers o actualmente instalada
 D : mostrar diferen as entre as vers es
 Z : iniciar uma shell para examinar a situa o
 A a o padr o   manter sua vers o atual.
*** rsyslog.conf (Y/I/N/O/D/Z) [padr o=N] ?
A tela a seguir um exemplo de mensagem via janela: Nos dois casos, recomend vel que voc escolha por instalar a nova vers o do arquivo de configura o. Isso porque o novo arquivo de configura o estar totalmente adaptado aos novos servi os instalados e poder ter muitas op es novas ou diferentes. Mas n o se preocupe, pois as suas configura es n o ser o perdidas. Haver um backup das mesmas. Assim, para shell, escolha a op o "Y" e, no caso de janela, escolha a op o "instalar a vers o do mantenedor do pacote". muito importante anotar o nome de cada arquivo modificado. No caso da janela anterior, trata-se do arquivo /etc/samba/smb.conf. No caso do shell o arquivo foi o /etc/rsyslog.conf. Depois de completar a migra o, voc poder ver o novo arquivo de configura o e o original. Caso o novo arquivo tenha sido instalado ap s uma escolha via shell, o arquivo original (o que voc tinha anteriormente) ter o mesmo nome com a extens o .dpkg-old. No caso de escolha via janela, o arquivo ser mantido com a extens o .ucf-old. Nos dois casos, voc poder ver as modifica es feitas e reconfigurar o seu novo arquivo de acordo com as necessidades. Caso voc precise de ajuda para ver as diferen as entre os arquivos, voc poder usar o comando diff para compar -los. Fa a o diff sempre do arquivo novo para o original. como se voc quisesse ver como fazer com o novo arquivo para ficar igual ao original. Exemplo:
# diff -Naur /etc/rsyslog.conf /etc/rsyslog.conf.dpkg-old
Em uma primeira vista, as linhas marcadas com "+" dever o ser adicionadas ao novo arquivo para que se pare a com o anterior, assim como as marcadas com "-" dever o ser suprimidas. Mas cuidado: normal que haja algumas linhas diferentes, pois o arquivo de configura o foi feito para uma nova vers o do servi o ou aplicativo ao qual ele pertence. Assim, altere somente as linhas que realmente s o necess rias e que voc mudou no arquivo anterior. Veja o exemplo:
+daemon.*;mail.*;\
+ news.err;\
+ *.=debug;*.=info;\
+ *.=notice;*.=warn  /dev/xconsole
+*.* @sam
No meu caso, originalmente, eu s alterei a ltima linha. Ent o, no novo arquivo de configura o, s terei interesse em adicionar essa linha. Bem, se foi voc quem fez a configura o anterior, voc saber fazer a coisa certa. Geralmente, n o haver muitas diferen as entre os arquivos. Outra op o para ver as diferen as entre arquivos o comando mcdiff, que poder ser fornecido pelo pacote mc. Exemplo:
# mcdiff /etc/rsyslog.conf /etc/rsyslog.conf.dpkg-old
Problemas com ambientes e aplica es gr ficas poss vel que voc tenha algum problema com o funcionamento de ambientes gr ficos, como Gnome, KDE etc, ou com aplica es como o Mozilla Firefox. Nesses casos, prov vel que o problema seja os arquivos de configura o desses elementos, existentes no diret rio home do usu rio. Para verificar, crie um novo usu rio no Debian e teste com ele. Se tudo der certo, fa a um backup das configura es anteriores (ou renomeie as mesmas) e deixe que a aplica o crie uma configura o nova. Por exemplo, para o Mozilla Firefox, v ao diret rio home do usu rio e, com o Firefox fechado, renomeie o diret rio .mozilla para .mozilla.bak, inicie o Firefox e teste. Est inseguro? Caso voc esteja muito inseguro, instale um Debian 8, com ambiente gr fico e outras coisas, em uma m quina virtual e migre para Debian 9 para testar e aprender. Sugiro VirtualBox como virtualizador. Divirta-se!

30 May 2017

Steinar H. Gunderson: Nageru 1.6.0 released

I've released version 1.6.0 of Nageru, my live video mixer, together with dependent libraries Movit 1.5.1 and bmusb 0.7.0. The primray new feature this time is integration with CasparCG, the dominating open-source broadcast graphics system, which opens up a whole new world of possibilities for intelligent overlay graphics. (Actually, the feature is a bit more generic than that; any FFmpeg file or stream will do as input. Audio isn't supported yet, though.) You can see a simple HTML5 CasparCG setup in the ultimate tournament stream test we did in April, in preparation of a larger event in September; CasparCG generates a stream with alpha, which is then fed into Nageru and used on top of the three camera sources. Apart from that, there's a new frame analyzer that helps with calibrating your signal chain; there are lots of devices that will happily mess with your signal, and measuring is the first step in counteracting that. (There's also a few input interpretation tweaks that will help most common issues.) 1.6.0 is on its way to Debian experimental, along with its dependencies (stretch will release with Nageru 1.4.2); there are likely to be backports when stretch releases and the backport queue opens up.

26 May 2017

Steinar H. Gunderson: Last minute stretch bugs

The last week, I found no less than three pet bugs I have hope that will be allowed to go in before stretch release: I promise, none of these were found late because I upgraded to stretch too late just a perfect storm. :-)

27 April 2017

Steinar H. Gunderson: Chinese HDMI-to-SDI converters, part II

Following up on my previous post, I have only a few small nuggets of extra information: The power draw appears to be 230 240 mA at 5 V, or about 1.15 W. (It doesn't matter whether they have a signal or not.) This means you can power them off of a regular USB power bank; in a recent test, we used Deltaco 20000 mAh (at 3.7 V) power banks, which are supposed to power a GoPro plus such a converter for somewhere between 8 12 hours (we haven't measured exactly). It worked brilliantly, and solved a headache of how to get AC to the camera and converter; just slap on a power bank instead, and all you need to run is SDI. Curiously enough, 230 mA is so little that the power bank thinks it doesn't count as load, and just turns itself off after ten seconds or so. However, with the GoPro also active, it stays on all the time. At least it ran for the two hours that we needed without a hitch. The SDI converters don't accept USB directly, but you can purchase dumb USB-to-5.5x2.1mm converters cheap from wherever, which works fine even though USB is supposed to give you only 100 mA without handshaking. Some eBay sellers seem to include them with the converters, even. I guess the power bank just doesn't care; it's spec-ed to 2.1 A on two ports and 1 A on the last one. Update: Sebastian Reichel pointed out that USB 2.0 ups the limit to 500 mA, so you should be within spec in any case. :-) And here's a picture of the entire contraption as a bonus: Power bank and HDMI-to-SDI converter

18 April 2017

Steinar H. Gunderson: Chinese HDMI-to-SDI converters

I often need to convert signals from HDMI to SDI (and occasionally back). This requires a box of some sort, and eBay obliges; there's a bunch of different sellers of the same devices, selling them around $20 25. They don't seem to have a brand name, but they are invariably sold as 3G-SDI converters (meaning they should go up to 1080p60) and look like this: There are also corresponding SDI-to-HDMI converters that look pretty much the same except they convert the other way. (They're easy to confuse, but that's not a problem unique tothem.) I've used them for a while now, and there are pros and cons. They seem reliable enough, and they're 1/4th the price of e.g. Blackmagic's Micro converters, which is a real bargain. However, there are also some issues: The last issue is by far the worst, but it only affects 3G-SDI resolutions. 720p60, 1080p30 and 1080i60 all work fine. And to be fair, not even Blackmagic's own converters actually send 352M correctly most of the time I wish there were a way I could publish this somewhere people would actually read it before buying these things, but without a name, it's hard for people to find it. They're great value for money, and I wouldn't hesitate to recommend them for almost all use but then, there's that almost. :-)

5 April 2017

Steinar H. Gunderson: Nageru 1.5.0 released

I just released version 1.5.0 of Nageru, my live video mixer. The biggest feature is obviously the HDMI/SDI live output, but there are lots of small nuggets everywhere; it's been four months in the making. I'll simply paste the NEWS entry here:
Nageru 1.5.0, April 5th, 2017
  - Support for low-latency HDMI/SDI output in addition to (or instead of) the
    stream. This currently only works with DeckLink cards, not bmusb. See the
    manual for more information.
  - Support changing the resolution from the command line, instead of locking
    everything to 1280x720.
  - The A/V sync code has been rewritten to be more in line with Fons
    Adriaensen's original paper. It handles several cases much better,
    in particular when trying to match 59.94 and 60 Hz sources to each other.
    However, it might occasionally need a few extra seconds on startup to
    lock properly if startup is slow.
  - Add support for using x264 for the disk recording. This makes it possible,
    among other things, to run Nageru on a machine entirely without VA-API
    support.
  - Support for 10-bit Y'CbCr, both on input and output. (Output requires
    x264 disk recording, as Quick Sync Video does not support 10-bit H.264.)
    This requires compute shader support, and is in general a little bit
    slower on input and output, due to the extra amount of data being shuffled
    around. Intermediate precision is 16-bit floating-point or better,
    as before.
  - Enable input mode autodetection for DeckLink cards that support it.
    (bmusb mode has always been autodetected.)
  - Add functionality to add a time code to the stream; useful for debugging
    latency.
  - The live display is now both more performant and of higher image quality.
  - Fix a long-standing issue where the preview displays would be too bright
    when using an NVIDIA GPU. (This did not affect the finished stream.)
  - Many other bugfixes and small improvements.
1.5.0 is on its way into Debian experimental (it's too late for the stretch release, especially as it also depends on Movit and bmusb from experimental), or you can get it from the home page as always.

4 April 2017

Matthias Klumpp: On Tanglu

It s time for a long-overdue blogpost about the status of Tanglu. Tanglu is a Debian derivative, started in early 2013 when the systemd debate at Debian was still hot. It was formed by a few people wanting to create a Debian derivative for workstations with a time-based release schedule using and showcasing new technologies (which include systemd, but also bundling systems and other things) and built in the open with a community using the similar infrastructure to Debian. Tanglu is designed explicitly to complement Debian and not to compete with it on all devices. Tanglu has achieved a lot of great things. We were the first Debian derivative to adopt systemd and with the help of our contributors we could kill a few nasty issues affecting it and Debian before it ended up becoming default in Debian Jessie. We also started to use the Calamares installer relatively early, bringing a modern installation experience additionally to the traditional debian-installer. We performed the usrmerge early, uncovering a few more issues which were fed back into Debian to be resolved (while workarounds were added to Tanglu). We also briefly explored switching from initramfs-tools to Dracut, but this release goal was dropped due to issues (but might be revived later). A lot of other less-impactful changes happened as well, borrowing a lot of useful ideas and code from Ubuntu (kudos to them!). On the infrastructure side, we set up the Debian Archive Kit (dak), managing to find a couple of issues (mostly hardcoded assumptions about Debian) and reporting them back to make using dak for distributions which aren t Debian easier. We explored using fedmsg for our infrastructure, went through a long and painful iteration of build systems (buildbot -> Jenkins -> Debile) before finally ending up with Debile, and added a set of own custom tools to collect archive QA information and present it to our developers in an easy to digest way. Except for wanna-build, Tanglu is hosting an almost-complete clone of basic Debian archive management tools. During the past year however, the project s progress slowed down significantly. For this, mostly I am to blame. One of the biggest challenges for a young project is to attract new developers and members and keep them engaged. A lot of the people coming to Tanglu and being interested in contributing were unfortunately no packagers and sometimes no developers, and we didn t have the manpower to individually mentor these people and teach them the necessary skills. People asking for tasks were usually asked where their interests were and what they would like to do to give them a useful task. This sounds great in principle, but in practice it is actually not very helpful. A curated list of junior jobs is a much better starting point. We also invested almost zero time in making our project known and create the necessary buzz and excitement that s actually needed to sustain a project like this. Doing more in the advertisement domain and help newcomers area is a high priority issue in the Tanglu bugtracker, which to the day is still open. Doing good alone isn t enough, talking about it is of crucial importance and that is something I knew about, but didn t realize the impact of for quite a while. As strange as it sounds, investing in the tech only isn t enough, community building is of equal importance. Regardless of that, Tanglu has members working on the project, but way too few to manage a project of this magnitude (getting package transitions migrated alone is a large task requiring quite some time while at the same time being incredibly boring :P). A lot of our current developers can only invest small amounts of time into the project because they have a lot of other projects as well. The other issue why Tanglu has problems is too much stuff being centralized on myself. That is a problem I wanted to rectify for a long time, but as soon as a task wasn t done in Tanglu because no people were available to do it, I completed it. This essentially increased the project s dependency on me as single person, giving it a really low bus factor. It not only centralizes power in one person (which actually isn t a problem as long as that person is available enough to perform tasks if asked for), it also centralizes knowledge on how to run services and how to do things. And if you want to give up power, people will need the knowledge on how to perform the specific task first (which they will never gain if there s always that one guy doing it). I still haven t found a great way to solve this it s a problem that essentially kills itself as soon as the project is big enough, but until then the only way to counter it slightly is to write lots of documentation. Last year I had way less time to work on Tanglu than the project deserves. I also started to work for Purism on their PureOS Debian derivative (which is heavily influenced by some of the choices we made for Tanglu, but with different focus that s probably something for another blogpost). A lot of the stuff I do for Purism duplicates the work I do on Tanglu, and also takes away time I have for the project. Additionally I need to invest a lot more time into other projects such as AppStream and a lot of random other stuff that just needs continuous maintenance and discussion (especially AppStream eats up a lot of time since it became really popular in a lot of places). There is also my MSc thesis in neuroscience that requires attention (and is actually in focus most of the time). All in all, I can t split myself and KDE s cloning machine remains broken, so I can t even use that ;-). In terms of projects there is also a personal hard limit of how much stuff I can handle, and exceeding it long-term is not very healthy, as in these cases I try to satisfy all projects and in the end do not focus enough on any of them, which makes me end up with a lot of half-baked stuff (which helps nobody, and most importantly makes me loose the fun, energy and interest to work on it). Good news everyone! (sort of) So, this sounded overly negative, so where does this leave Tanglu? Fact is, I can not commit the crazy amounts of time for it as I did in 2013. But, I love the project and I actually do have some time I can put into it. My work on Purism has an overlap with Tanglu, so Tanglu can actually benefit from the software I develop for them, maybe creating a synergy effect between PureOS and Tanglu. Tanglu is also important to me as a testing environment for future ideas (be it in infrastructure or in the make bundling nice! department). So, what actually is the way forward? First, maybe I have the chance to find a few people willing to work on tasks in Tanglu. It s a fun project, and I learned a lot while working on it. Tanglu also possesses some unique properties few other Debian derivatives have, like being built from source completely (allowing us things like swapping core components or compiling with more hardening flags, switching to newer KDE Plasma and GNOME faster, etc.). Second, if we do not have enough manpower, I think converting Tanglu into a rolling-release distribution might be the only viable way to keep the project running. A rolling release scheme creates much less effort for us than making releases (especially time-based ones!). That way, users will have a constantly updated and secure Tanglu system with machines doing most of the background work. If it turns out that absolutely nothing works and we can t attract new people to help with Tanglu, it would mean that there generally isn t much interest from the developer or user side in a project like this, so shutting it down or scaling it down dramatically would be the only option. But I do not think that this is the case, and I believe that having Tanglu around is important. I also have some interesting plans for it which will be fun to implement for testing  The only thing that had to stop is leaving our users in the dark on what is happening. Sorry for the long post, but there are some subjects which are worth writing more than 140 characters about  If you are interested in contributing to Tanglu, get in touch with us! We have an IRC channel #tanglu-devel on Freenode (go there for quicker responses!), forums and mailinglists, It looks like I will be at Debconf this year as well, so you can also catch me there! I might even talk about PureOS/Tanglu infrastructure at the conference.

21 March 2017

Steinar H. Gunderson: 10-bit H.264 support

Following my previous tests about 10-bit H.264, I did some more practical tests; since media.xiph.org is up again, I did some tests with actual 10-bit input. The results were pretty similar, although of course 4K 60 fps organic content is going to be different at times from the partially rendered 1080p 24 fps clip I used. But I also tested browser support, with good help from people on IRC. It was every bit as bad as I feared: Chrome on desktop (Windows, Linux, macOS) supports 10-bit H.264, although of course without hardware acceleration. Chrome on Android does not. Firefox does not (it tries on macOS, but plays back buggy). iOS does not. VLC does; I didn't try a lot of media players, but obviously ffmpeg-based players should do quite well. I haven't tried Chromecast, but I doubt it works. So I guess that yes, it really is 8-bit H.264 or 10-bit HEVC but I haven't tested the latter yet either :-)

9 March 2017

Steinar H. Gunderson: Tired

To be honest, at this stage I'd actually prefer ads in Wikipedia to having ever more intrusive begging for donations. Please go away soon.

27 February 2017

Steinar H. Gunderson: 10-bit H.264 tests

Following the post about 10-bit Y'CbCr earlier this week, I thought I'd make an actual test of 10-bit H.264 compression for live streaming. The basic question is; sure, it's better-per-bit, but it's also slower, so it is better-per-MHz? This is largely inspired by Ronald Bultje's post about streaming performance, where he largely showed that HEVC is currently useless for live streaming from software; unless you can encode at x264's veryslow preset (which, at 720p60, means basically rather simple content and 20 cores or so), the best x265 presets you can afford will give you worse quality than the best x264 presets you can afford. My results will maybe not be as scientific, but hopefully still enlightening. I used the same test clip as Ronald, namely a two-minute clip of Tears of Steel. Note that this is an 8-bit input, so we're not testing the effects of 10-bit input; it's just testing the increased internal precision in the codec. Since my focus is practical streaming, I ran the last version of x264 at four threads (a typical desktop machine), using one-pass encoding at 4000 kbit/sec. Nageru's speed control has 26 presets to choose from, which gives pretty smooth steps between neighboring ones, but I've been sticking to the ten standard x264 presets (ultrafast, superfast, veryfast, faster, fast, medium, slow, slower, veryslow, placebo). Here's the graph: The x-axis is seconds used for the encode (note the logarithmic scale; placebo takes 200 250 times as long as ultrafast). The y-axis is SSIM dB, so up and to the left is better. The blue line is 8-bit, and the red line is 10-bit. (I ran most encodes five times and averaged the results, but it doesn't really matter, due to the logarithmic scale.) The results are actually much stronger than I assumed; if you run on (8-bit) ultrafast or superfast, you should stay with 8-bit, but from there on, 10-bit is on the Pareto frontier. Actually, 10-bit veryfast (18.187 dB) is better than 8-bit medium (18.111 dB), while being four times as fast! But not all of us have a relation to dB quality, so I chose to also do a test that maybe is a bit more intuitive, centered around bitrate needed for constant quality. I locked quality to 18 dBm, ie., for each preset, I adjusted the bitrate until the SSIM showed 18.000 dB plus/minus 0.001 dB. (Note that this means faster presets get less of a speed advantage, because they need higher bitrate, which means more time spent entropy coding.) Then I measured the encoding time (again five times) and graphed the results: x-axis is again seconds, and y-axis is bitrate needed in kbit/sec, so lower and to the left is better. Blue is again 8-bit and red is again 10-bit. If the previous graph was enough to make me intrigued, this is enough to make me excited. In general, 10-bit gives 20-30% lower bitrate for the same quality and CPU usage! (Compare this with the supposed up to 50% benefits of HEVC over H.264, given infinite CPU usage.) The most dramatic example is when comparing the medium presets directly, where 10-bit runs at 2648 kbit/sec versus 3715 kbit/sec (29% lower bitrate!) and is only 5% slower. As one progresses towards the slower presets, the gap is somewhat narrowed (placebo is 27% slower and only 24% lower bitrate), but in the realistic middle range, the difference is quite marked. If you run 3 Mbit/sec at 10-bit, you get the quality of 4 Mbit/sec at 8-bit. So is 10-bit H.264 a no-brainer? Unfortunately, no; the client hardware support is nearly nil. Not even Skylake, which can do 10-bit HEVC encoding in hardware (and 10-bit VP9 decoding), can do 10-bit H.264 decoding in hardware. Worse still, mobile chipsets generally don't support it. There are rumors that iPhone 6s supports it, but these are unconfirmed; some Android chips support it, but most don't. I guess this explains a lot of the limited uptake; since it's in some ways a new codec, implementers are more keen to get the full benefits of HEVC instead (even though the licensing situation is really icky). The only ones I know that have really picked it up as a distribution format is the anime scene, and they're feeling quite specific pains due to unique content (large gradients giving pronounced banding in undithered 8-bit). So, 10-bit H.264: It's awesome, but you can't have it. Sorry :-)

23 February 2017

Steinar H. Gunderson: Fyrrom recording released

The recording of yesterday's Fyrrom (Samfundet's unofficial take on Boiler Room) is now available on YouTube. Five video inputs, four hours, two DJs, no dropped frames. Good times. Soundcloud coming soon!

Next.

Previous.